There are a number of topic model instantiations in R. We will go
through one of the more popular ones – topicmodels which
plays very nicely with the tidyverse.1
topicmodels’s main topic model function is
LDA(), which stands for Latent Dirichlet Allocation, a type
of topic model and often used as shorthand for topic models in general.
It takes a DTM as input and gives us an object of class LDA
as output, which we can then analyze and visualize in the tidyverse.
There are many points where we can customize, adjust parameters and so
on but the one we must specify is the number of topics. This is
something that often takes some fiddling with. Unless you have reason to
think that the number of topics is extremely limited in a certain corpus
one generally uses between ~20-50 topics. The other parameter it makes
sense to think of prior to, or under, analysis is document size. As
we’ve seen, a DTM will break up a text without concern for order within
indvidual documents. So large documents will be extremely generalized in
a DTM. It could well be reasonable to break up books, for example, by
chapter. We could go more finer grained as well – chunking by paragraph
might make sense sometimes, too. Much will depend on the corpus and
object of analysis. Experiment and see what leads to the most
understandable and coherent topics.
options(stringsAsFactors = FALSE)
library(tidyverse)
library(tidytext)
library(topicmodels)
# read in the dataframe into R as normal
nobel_tidy <- read_rds("data/nobel_stemmed.Rds") %>%
select(Year, Laureate, word_stem) %>%
rename(Year = Year, Laureate = Laureate, words = word_stem)
# transform dataframe to DTM
nobel_dtm <- nobel_tidy %>%
group_by(Year) %>%
count(words, sort = TRUE) %>%
cast_dtm(Year, words, n)
There are many points where we can customize, adjust parameters and so on but the one we must specify is the number of topics. This is something that often takes some fiddling with. Unless you have reason to think that the number of topics is extremely limited in a certain corpus one generally uses between ~15-50 topics (very roughly).
Another parameter it makes sense to think of prior to, or under, analysis is document size. As we’ve seen, a DTM will break up a text without concern for order within individual documents. So large documents will be extremely generalized in a DTM. It could well be reasonable to break up books, for example, by chapter. We could go more finer grained as well – chunking by paragraph might make sense sometimes, too. Much will depend on the corpus and object of analysis. Experiment and see what leads to the most understandable and coherent topics.
We are also using the corpus that we have already cleaned and removed stopwords from. We might also question if certain words are turning up so much in every document that they won’t add anything to the topics that the topic model finds (removing frequently appearing words will also reduce the time it takes for the algorithm to fit the topic model). We might consider if, in the Nobel corpus, the word “nobel” will add anything to any of the topics, especially if we are treating the documents as the speeches as a whole. It might or might or not, topic models take some experimentation.
Lastly, the alpha parameter controls how much documents come to be dominated by one or few topics or if the topics are more evenly distributed over documents. This parameter is automatically optimized by the algorithm if the user does not set it, but often algothithmic optimization does not lead to the best model fit from the standpoint of a human. This model tends toward a low alpha and very uneven topic spread so we’ll set it ourselves. Again, this is something the analyst must experiment with.
k = 15
alpha = 2
nobel_tm <- LDA(nobel_dtm, k = k, alpha = alpha)
Fitting the model involves us telling R finding a distributions that best match the corpus we have given the general structural assumptions the topic model takes. There are different methods for doing this and they might take a while. We are interested in two distributions: theta (\(\theta\)) – the proportion of each document devoted to which topics, and beta (\(\beta\)) – the proportion of each topic made up by which words (see the presentation pdf for details).
Let’s first take a look at the output of the topic model. We call
posterior() to get these so-called posterior
distributions.
str(posterior(nobel_tm))
## List of 2
## $ terms : num [1:15, 1:8363] 3.32e-04 1.04e-04 1.27e-84 5.20e-219 1.19e-84 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:15] "1" "2" "3" "4" ...
## .. ..$ : chr [1:8363] "refuge" "nuclear" "weapon" "war" ...
## $ topics: num [1:92, 1:15] 6.33e-06 9.01e-06 9.12e-06 6.89e-06 6.89e-06 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:92] "1981" "2017" "1954" "1925" ...
## .. ..$ : chr [1:15] "1" "2" "3" "4" ...
If you call str() on this object you see
topicmodels has returned two distributions, one called
term that is made up of a matrix of the twenty topics on
one axis and the 8063 unique words in the corpus on the other, with each
entry indicating likelihood of that word turning up given the topic (we
might think of this as the proportion of the topic taken up by each word
in the corpus). It is a probability distribution so each words
probability within a given topic has to sum to 1. This is the
beta matrix. The topics distribution we see is a matrix
size 92 x 20, the likelihood of each document (speech) containing each
of 20 topics – also summing to 1 within each document and that we might
think of as proportions. So what do we do with this?
The most useful thing to look at straight away are the highest words in each topic – do the topics make sense to a human?
terms(nobel_tm, 15)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6
## [1,] "peac" "peac" "peac" "right" "peac" "peac"
## [2,] "women" "war" "nuclear" "countri" "peopl" "war"
## [3,] "prize" "nation" "world" "peac" "war" "intern"
## [4,] "world" "world" "war" "human" "violenc" "committe"
## [5,] "nobel" "intern" "nation" "world" "struggl" "human"
## [6,] "countri" "leagu" "intern" "nation" "world" "world"
## [7,] "war" "organ" "pearson" "declar" "conflict" "prize"
## [8,] "right" "time" "disarma" "war" "nobel" "cross"
## [9,] "human" "countri" "lester" "ilo" "northern" "nation"
## [10,] "wheat" "marshal" "time" "peopl" "polit" "red"
## [11,] "time" "american" "committe" "develop" "live" "prison"
## [12,] "peopl" "peopl" "arm" "nobel" "prize" "organ"
## [13,] "develop" "unit" "weapon" "prize" "human" "countri"
## [14,] "norwegian" "prize" "power" "intern" "luther" "nobel"
## [15,] "award" "presid" "unit" "organ" "dalai" "right"
## Topic 7 Topic 8 Topic 9 Topic 10 Topic 11 Topic 12
## [1,] "refuge" "nuclear" "nation" "peac" "peac" "nation"
## [2,] "countri" "weapon" "unit" "intern" "human" "time"
## [3,] "nation" "peac" "peac" "american" "prize" "war"
## [4,] "peopl" "war" "weapon" "confer" "nobel" "leagu"
## [5,] "world" "world" "nuclear" "nation" "right" "peac"
## [6,] "offic" "nobel" "paul" "presid" "peopl" "world"
## [7,] "peac" "prize" "forc" "law" "world" "unit"
## [8,] "war" "intern" "test" "countri" "south" "polit"
## [9,] "prize" "award" "world" "plan" "polit" "govern"
## [10,] "govern" "nation" "chemic" "america" "committe" "noel"
## [11,] "unicef" "committe" "war" "war" "award" "intern"
## [12,] "organ" "presid" "linu" "pact" "time" "baker"
## [13,] "unit" "iaea" "time" "organ" "regim" "confer"
## [14,] "award" "disarma" "congo" "polit" "laureat" "nansen"
## [15,] "nobel" "countri" "hammarskj<U+00F6>ld" "lama" "violenc" "life"
## Topic 13 Topic 14 Topic 15
## [1,] "peac" "war" "peac"
## [2,] "war" "nation" "prize"
## [3,] "world" "peac" "nobel"
## [4,] "prize" "world" "peopl"
## [5,] "nation" "countri" "presid"
## [6,] "life" "leagu" "countri"
## [7,] "peopl" "europ" "award"
## [8,] "countri" "peopl" "norwegian"
## [9,] "nobel" "time" "committe"
## [10,] "brandt" "germani" "world"
## [11,] "willi" "quaker" "carter"
## [12,] "europ" "polici" "war"
## [13,] "award" "polit" "nation"
## [14,] "govern" "father" "parti"
## [15,] "polici" "organ" "ahtisaari"
We can, of course, work directly with these data structures but per our approach in this workshop, we’re going to tidy our results and take the data interpretation and visualization back to the tidyverse where we have all its tools at our disposal.
Let’s first plot the top words in each topic. This is generally where
you want to start in evaluating a topic model – are the topics
interpretable. We use tidy() to transform the beta matrix
into tidy format (one word per row) and then it is a simple task for us
to plot it in ggplot.
terms <- tidy(nobel_tm, matrix = "beta")
words_in_topics <- terms %>%
group_by(topic) %>%
slice_max(beta, n = 10) %>%
ungroup() %>%
arrange(topic, -beta)
words_in_topics %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
Let’s turn to the matrix of probabilities of topics over documents.
To keep us on our toes topicmodels calls this not theta but
gamma (\(\gamma\)).
topics_in_documents <- tidy(nobel_tm, matrix = "gamma")
topics_in_documents
## # A tibble: 1,380 x 3
## document topic gamma
## <chr> <int> <dbl>
## 1 1981 1 0.00000633
## 2 2017 1 0.00000901
## 3 1954 1 0.00000912
## 4 1925 1 0.00000689
## 5 1926 1 0.00000689
## 6 1968 1 0.00000769
## 7 2013 1 0.0000129
## 8 1988 1 0.0000110
## 9 1953 1 0.00000566
## 10 2016 1 0.0000111
## # ... with 1,370 more rows
## # i Use `print(n = ...)` to see more rows
This tells us the estimated proportion of words in each given document devoted (generated by) to a specific topic. A problem here is that numbering topics makes it hard to figure out what this means. So we can first rename the topics. We can do this by hand (recommended) or automatically based on the highest ranking words in the previous beta matrix.
# labelling by hand, we would extend this to 1:20, and given 20 topics if we wanted to name them all
#hand_topics <- tibble(old_topic = 1:3, new_topic = c("International peace", "Nuclear", "Peac and war"))
#topics_in_documents %>%
# left_join(hand_topics_topics, by=c("topic" = "old_topic"))
# alternative two, easier for demonstration purposes on a sub-optimally-fit topic model
(auto_topics <- apply(terms(nobel_tm, 3), 2, paste, collapse = "-")) # pastes together the top three terms for each topic in the nobel topic model
## Topic 1 Topic 2 Topic 3
## "peac-women-prize" "peac-war-nation" "peac-nuclear-world"
## Topic 4 Topic 5 Topic 6
## "right-countri-peac" "peac-peopl-war" "peac-war-intern"
## Topic 7 Topic 8 Topic 9
## "refuge-countri-nation" "nuclear-weapon-peac" "nation-unit-peac"
## Topic 10 Topic 11 Topic 12
## "peac-intern-american" "peac-human-prize" "nation-time-war"
## Topic 13 Topic 14 Topic 15
## "peac-war-world" "war-nation-peac" "peac-prize-nobel"
(auto_topics <- tibble(old_topic = 1:k, new_topic = auto_topics)) # make as tibble where numeric topics are matched with the auto generated ones
## # A tibble: 15 x 2
## old_topic new_topic
## <int> <chr>
## 1 1 peac-women-prize
## 2 2 peac-war-nation
## 3 3 peac-nuclear-world
## 4 4 right-countri-peac
## 5 5 peac-peopl-war
## 6 6 peac-war-intern
## 7 7 refuge-countri-nation
## 8 8 nuclear-weapon-peac
## 9 9 nation-unit-peac
## 10 10 peac-intern-american
## 11 11 peac-human-prize
## 12 12 nation-time-war
## 13 13 peac-war-world
## 14 14 war-nation-peac
## 15 15 peac-prize-nobel
(topics <- topics_in_documents %>%
left_join(auto_topics, by=c("topic" = "old_topic")))
## # A tibble: 1,380 x 4
## document topic gamma new_topic
## <chr> <int> <dbl> <chr>
## 1 1981 1 0.00000633 peac-women-prize
## 2 2017 1 0.00000901 peac-women-prize
## 3 1954 1 0.00000912 peac-women-prize
## 4 1925 1 0.00000689 peac-women-prize
## 5 1926 1 0.00000689 peac-women-prize
## 6 1968 1 0.00000769 peac-women-prize
## 7 2013 1 0.0000129 peac-women-prize
## 8 1988 1 0.0000110 peac-women-prize
## 9 1953 1 0.00000566 peac-women-prize
## 10 2016 1 0.0000111 peac-women-prize
## # ... with 1,370 more rows
## # i Use `print(n = ...)` to see more rows
Now we have our data in a familiar format we can subset and visualize. Perhaps we’d like to compare the topic distribution in several topics.
topics %>%
filter(document %in% c(1977, 1985, 1996)) %>% # the documents we want to compare
ggplot(aes(new_topic, gamma, fill = document)) +
geom_col() +
coord_flip() +
facet_wrap(~ document, ncol = 3)
We can visualize the distribution of all topics over time.
topics %>%
ggplot(aes(document, gamma)) +
geom_col(aes(group = new_topic, fill = new_topic)) +
scale_x_discrete(breaks = seq(1905, 2019, 10))
Or look at the distribution of specific topics over time.
# This one requires a more balanced topic mixture to be very meaningful, which the Nobel corpus with its current fit does to have
topics %>%
filter(str_detect(new_topic, "war")) %>%
ggplot(aes(document, gamma)) +
geom_line(aes(group = new_topic, color = new_topic)) +
scale_x_discrete(breaks = seq(1905, 2019, 10))
There are several packages in R that fit topic models, most notably
stm which is incorporates a host of handy visualization
tools as well as the capacity to incorporate covariates into the model
fit (@roberts2019stm).
This is only the most basic of introductions to topic modeling. For more information on topic modeling and analysis in the tidyverse, see chapter 6 of @silge2017text.
For a good explainer on topic models, see @underwood2012.
2022.